Alberto Munguia (am5334)
Bernardo Lopez (bl2786)
Ivan Ugalde (du2160)
The code for this analysis is published in a public Git Hub repository.
Friends is an American situation comedy, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast starring Jennifer Aniston (Rachel), Courteney Cox (Monica), Lisa Kudrow (Phoebe), Matt LeBlanc (Joey), Matthew Perry (Chandler) and David Schwimmer (Ross).
The show revolved around six friends in their 20s and 30s who lived in Manhattan, New York City. Rachel Green, a sheltered but friendly woman, flees her wedding day and her rich yet unfulfilling life, and finds childhood friend Monica Geller, a tightly-wound but caring chef. After Rachel becomes a waitress at coffee house Central Perk, she and Monica become roommates at Monica’s apartment located directly above Central Perk, and Rachel joins Monica’s group of single people in their mid-20s: her previous roommate Phoebe Buffay, an eccentric, innocent masseuse; her neighbor across the hall Joey Tribbiani, a dim-witted yet loyal struggling actor and womanizer; Joey’s roommate Chandler Bing, a sarcastic, self-deprecating IT manager; and her older brother and Chandler’s college roommate Ross Geller, a sweet-natured but insecure paleontologist.
Friends received positive reviews throughout its run and became one of the most popular sitcoms of its time. The series won many awards and was nominated for 63 Prime time Emmy Awards. The series was also very successful in the ratings, consistently ranking in the top ten in the final prime time ratings. Friends has made a large cultural impact, and has become an the model to follow for sitcoms.
As teenagers at the beginning of the century, we were heavily influenced by the Friends phenomenon and became huge fans of the sitcom. We decided to work on this project to challenge through a data analysis our preconceptions of the show and discover hidden insights. The questions that guide our quantitative assessment are the following:
Can we categorize by importance all the appearing characters of the sitcom? This question at first glance could seem simple but under the assumption that we do not possess any previous knowledge of the sitcom and considering that over the ten seasons more than 800 characters appeared in the show, the analysis represents a challenge.
Can we identify and quantify the interactions between the main and secondary characters? What would be an appropriate way to quantify and visualize these relationships?
Which are the most recurrent topics through the seasons and episodes of the show? And how the thematic of the show evolved over its ten seasons? Can we extract this information from the dialogues of the show?
Can we determine the contribution of each character to the popularity of the sitcom? Does the participation of each character influence the viewer’s preferences?
We have use the next R libraries for the development of this project :
For data extraction and manipulation: dplyr, rvest, robotstxt,base, Hmisc
For data visualization: ggplot2, ggthemes, ggrepel, visNetwork, d3, ’knitr
Furthermore, for run some of the analysis and visualizations we have used some Machine Learning (ML) techniques and other statistical tools such as: K means analysis(cluster), Graph and network analysis(igraph), Topic analysis(textmineR, stopwords)
The primary data sources that we used for our project and that we consider that have an adequate quality are:
Transcripts: For the transcripts, we used an open resource built by fans of the sitcom that has been compiled in a GitHub repository. The repository contains all the dialogues of the characters for the 231 episodes of the tv-show. The data is organized in Html documents.. The data can be accessed via: https://fangj.github.io/friends/. If you want to see how the transcripts are originally presented please click here.
Ratings: For the ratings, we have used the IMDb Datasets which are available for access to customers for personal and non-commercial use. The data is structured in seven compressed CSV files that contain general information of the show (genre, start year, end year, episode duration, etc.), and specific information of each episode (title, rating, characters, crew, etc.). A relevant characteristic of the database is that it is refreshed daily. We have made the consultation of the Data on November 10, 2019. The datasets can be accessed via: https://datasets.imdbws.com/
IMDb Dataset:
The first obstacle that we faced with the IMDb datasets was the size of the datasets, some with millions of rows with information of TV shows, shorts, movies, documentaries, and other entertainment formats. Due to their size, it was not possible to store them in GitHub.
The second obstacle was to identify the data corresponding to our case of study. For example, we searched in the dataset only by name ‘Friends’ and we found 178 results of TV shows and movies called ‘Friends’. It was necessary to understand and do some research on the years of beginning and end of the series to refine the search.
Another obstacle was that the ID for TV-series across the seven IMDb datasets was not uniform. For example, in the dataset corresponding to the titles of the TV-series, the ID to identify the show is named “tconst”, while on the dataset that where we can get the ID of the episode the name correspond to the ID of the episode, and the ID for the TV-series is called “parentTconst”. These errors were identified through the exploration of the datasets.
Transcipt Dataset:
The main obstacle of the dialogue dataset is that not all the HTML files share the same format. We have overcome this difficulty by incorporating special cases in our scraping code that took into account the special cases that we have detected.
The second difficulty that we have experienced in the dialogue dataset is the cleaning of the dialogues. We tried to standardize as much as possible the content of the dialogues, by identifying different names for the same character, common typos and regular expressions that could hinder our analysis.
You can follow the scraping code that lead to the following data frame by looking into "EVAD_friends.Rmd" file in the Git Hub repository.
url <- "https://fangj.github.io/friends/"
paths_allowed(url)
## # A tibble: 6 x 5
## episode_id line_num scene character line
## <chr> <dbl> <dbl> <chr> <chr>
## 1 1 : 01 1 1 MONICA There's nothing to tell! He's just s…
## 2 1 : 01 2 1 JOEY C'mon, you're going out with the guy…
## 3 1 : 01 3 1 CHANDLER All right Joey, be nice. So does he …
## 4 1 : 01 4 1 PHOEBE Wait, does he eat chalk?
## 5 1 : 01 5 1 PHOEBE Just, 'cause, I don't want her to go…
## 6 1 : 01 6 1 MONICA Okay, everybody relax. This is not e…
## # A tibble: 6 x 2
## line words
## <chr> <int>
## 1 There's nothing to tell! He's just some guy I work with! 11
## 2 C'mon, you're going out with the guy! There's gotta be something w… 14
## 3 All right Joey, be nice. So does he have a hump? A hump and a hair… 16
## 4 Wait, does he eat chalk? 5
## 5 Just, 'cause, I don't want her to go through what I went through w… 16
## 6 Okay, everybody relax. This is not even a date. It's just two peop… 21
We can see that some episodes where put together in the same file:
## [1] "2 : 12-13" "6 : 15-16" "9 : 23-24" "10 : 17-18"
We split those episodes into two different ones and for more clarity, added season and episode columns.
Then we had to correct some character names that had typos and we removed some lines that the scraping code caught that are not dialogues.
## # A tibble: 6 x 8
## episode_id line_num scene character line words season episode
## <chr> <dbl> <dbl> <chr> <chr> <int> <int> <int>
## 1 1 : 01 1 1 MONICA There's nothing… 11 1 1
## 2 1 : 01 2 1 JOEY C'mon, you're g… 14 1 1
## 3 1 : 01 3 1 CHANDLER All right Joey,… 16 1 1
## 4 1 : 01 4 1 PHOEBE Wait, does he e… 5 1 1
## 5 1 : 01 5 1 PHOEBE Just, 'cause, I… 16 1 1
## 6 1 : 01 6 1 MONICA Okay, everybody… 21 1 1
More data transformation were used and explained in each of the Results subsections.
Step 1.2: We extracted, decompressed and saved as dataframes. The 3 tables from IMDb that we have used in our analysis are those that allowed us to extract the information related to the rating of each episode.
title.ratings.tsv.gz
## 'data.frame': 990485 obs. of 3 variables:
## $ tconst : Factor w/ 990485 levels "tt0000001","tt0000002",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ averageRating: num 5.6 6.1 6.5 6.2 6.1 5.2 5.5 5.4 5.4 6.9 ...
## $ numVotes : int 1547 187 1204 114 1932 102 615 1663 81 5539 ...
title.episode.tsv.gz
## 'data.frame': 4425501 obs. of 4 variables:
## $ tconst : Factor w/ 4425501 levels "tt0041951","tt0042816",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ parentTconst : Factor w/ 128694 levels "tt0038276","tt0039122",..: 59 34810 34810 22 34810 34810 34810 34274 34810 34810 ...
## $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 2 2 1 131 103 103 131 2 131 179 ...
## $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 14445 6320 1 9103 6209 13326 7769 11103 9547 1115 ...
## 'data.frame': 236 obs. of 9 variables:
## $ parentTconst : chr "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
## $ titleType : Factor w/ 1 level "tvSeries": 1 1 1 1 1 1 1 1 1 1 ...
## $ primaryTitle : Factor w/ 1 level "Friends": 1 1 1 1 1 1 1 1 1 1 ...
## $ tconst : chr "tt0583431" "tt0583432" "tt0583433" "tt0583434" ...
## $ seasonNumber : Factor w/ 244 levels "\\N","1","10",..: 212 3 3 3 223 3 190 201 103 190 ...
## $ episodeNumber: Factor w/ 15556 levels "\\N","0","1",..: 13326 14445 6320 6431 3 3 3 3 2226 7769 ...
## $ averageRating: num 8.2 8.6 9.5 9.7 8.7 8.5 8.9 8.7 8.6 8.8 ...
## $ numVotes : int 2568 2641 5829 9699 2783 2889 3376 2962 3472 3100 ...
## $ episode_id : chr "7 : 08" "10 : 09" "10 : 17" "10 : 18" ...
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61264 obs. of 12 variables:
## $ episode_id : chr "1 : 01" "1 : 01" "1 : 01" "1 : 01" ...
## $ line_num : num 1 2 3 4 5 6 7 8 9 10 ...
## $ scene : num 1 1 1 1 1 1 1 2 2 2 ...
## $ character : chr "MONICA" "JOEY" "CHANDLER" "PHOEBE" ...
## $ line : chr "There's nothing to tell! He's just some guy I work with!" "C'mon, you're going out with the guy! There's gotta be something wrong with him!" "All right Joey, be nice. So does he have a hump? A hump and a hairpiece?" "Wait, does he eat chalk?" ...
## $ words : int 11 14 16 5 16 21 6 22 5 11 ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ episode : int 1 1 1 1 1 1 1 1 1 1 ...
## $ parentTconst : chr "tt0108778" "tt0108778" "tt0108778" "tt0108778" ...
## $ tconst : chr "tt0583459" "tt0583459" "tt0583459" "tt0583459" ...
## $ averageRating: num 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 ...
## $ numVotes : int 6098 6098 6098 6098 6098 6098 6098 6098 6098 6098 ...
To search for missing values we look at the number of missing values per columns in dialogues.
## episode_id line_num scene character line
## 0 0 0 0 61
## words season episode parentTconst tconst
## 0 0 0 0 0
## averageRating numVotes
## 0 0
There appear to be some missing values for line column. We will use visna from extracats library to see the pattern of missing values.
This missing values are due to a different formats in the Git Hub page used for web scraping. For example, lines like “Paolo: (something in Italian)” render a NA line because the scraping code removes everything between parenthesis.
We decided to fill those missing values with "". By doing so we will keep the register for those characters dialogue.
To answer this question we have used the unsupervised Machine Learning technique of K-Mean. Its objective s to label the data based on certain characteristics, in this case, we used the number of words, lines, and scenes. To accomplish this task we have used the libraries cluster, and base. Moreover, we have established established a priori the desired number of labels the we wanted for our data, for practicality terms we decides to set the size of the groups or k means.
From the k mean analysis obtain the following separation of characters: * Main Characters: As expected Rachel, Monica, Phoebe, Joey, Chandler and Ross constitute one group the has that on average has 1,680 scenes, 8469 lines and 87,498 words. * Secondary Characters: This group is composed by 33 characters, most of them are recurrent characters and guest stars. The average character in this group has on average 35 scenes, 133 lines and 1,228 words.
Centers:
## Total_scene Total_lines Total_words vcluster
## 1 1679.833333 8469.33333 87783.33333 3
## 2 36.343750 135.81250 1251.81250 1
## 3 2.487365 7.34296 65.11432 2
Friends is a TV show that tells the story of a group of six friends: Monica, Rachel, Phoebe, Chandler, Ross and Joey. Is one of these characters more important than others? We try to answer this question by looking at the number of lines for each of these main characters.
We can see that Rachel is the character with more lines and Phoebe is the character with less lines. Now we focus in the number of words instead of the number of lines.
Rachel and Ross are again the characters that speak the most and Phoebe the one with less words. We can see that Monica was number 3 for number lines but she is number 5 for number of words. This suggests that Monica’s lines tend to be shorter. The opposite happens with Joey. He is number 5 for number of lines, but he is third for number of words. This suggests his lines tend to be longer.
By looking into lines per episode distribution we find the following: * Monica’s distribution looks more narrow that the others. This indicates that there are few episodes in which Monica speaks a lot. * Chandler and Ross have large right tails, we infer that those characters have episodes in which they speak a lot. * Rachel and Ross have wider distributions.
For the Network analysis, a special data structure is required. We established a definition of interaction between characters when they share the same scene. We must mention that the original data structure of the dialogues does not permit us to identify the exact interaction of the characters in the scene. Hence, we have assumed that all the characters that appeared in every scene interacted between them. Moreover, we have assumed that the interactions between the characters will be represented by an adjacency matrix where we can observe the number of interactions that each character has with the others.
With the library igraph) we were able to create the adjacency matrix of the characters and quantify the interactions among the 869 characters. For the 6 main characters the adjacency matrix look like this:
## CHANDLER JOEY RACHEL ROSS MONICA PHOEBE
## CHANDLER 0 991 697 790 1051 729
## JOEY 991 0 747 764 758 746
## RACHEL 697 747 0 949 847 827
## ROSS 790 764 949 0 733 689
## MONICA 1051 758 847 733 0 875
## PHOEBE 729 746 827 689 875 0
Also, we Can visualize interactively the relationship between the Main and the Secondary Characters. The width of the graph represents the level of interactions among the characters, as you may perceive the interactions between the main characters is very balanced. Click and drag the vertices, or select the groups or the characters
For topic modelling we will use the package textmineR. We will try to find the topic for each episode. To do so we will create a document for each episode, so we have to group lines by episode_id.
## # A tibble: 6 x 2
## episode_id lines
## <chr> <chr>
## 1 1 : 01 There's nothing to tell! He's just some guy I work with! C'mo…
## 2 1 : 02 What you guys don't understand is, for us, kissing is as impo…
## 3 1 : 03 Hi guys! Hey, Pheebs! Hi! Hey. Oh, oh, how'd it go? Um, not s…
## 4 1 : 04 "Alright. Phoebe? Okay, okay. If I were omnipotent for a day,…
## 5 1 : 05 "Would you let it go? It's not that big a deal. Not that big …
## 6 1 : 06 Ooh! Look! Look! Look! Look, there's Joey's picture! This is …
Function CrateDtm creates a document term matrix. To do so we use a group of stopwords, words we don’t want to use because they are used frequently in English language and do not give insightful information.
We will use document term matrix to create a Term Document Frequency matrix that counts the number of times a term appears (term frequency) and the number of documents in which a term appears (document frequency).
These are the main terms ordered by term frequency:
## term term_freq doc_freq
## 11509 good 1714 231
## 11508 god 1677 228
## 11507 guys 1468 225
## 11506 great 1342 225
## 11505 time 1215 229
## 11504 back 1125 223
Now we fit a Latent Dirichlet allocation model in which we will try to fit 15 topics into the collection of episodes. This will return to main matrices:
## [1] "Theta:"
## t_1 t_2 t_3 t_4 t_5
## 1 : 01 0.0222650602 0.0511807229 9.638554e-05 0.056000000 0.1003373
## 1 : 02 0.0001471670 0.0413539367 1.044886e-02 0.000147167 0.1428992
## 1 : 03 0.0001581028 0.0033201581 3.320158e-03 0.011225296 0.2467984
## 1 : 04 0.0080848244 0.2042412194 1.325381e-04 0.046520875 0.2320742
## 1 : 05 0.0015907448 0.0001446132 1.446132e-04 0.014605929 0.1404194
## 1 : 06 0.0253145818 0.0238341969 1.643227e-02 0.001628423 0.2192450
## [1] "Phi:"
## utah gauze_gauze speaks buddy_boy pac_man
## t_1 8.844157e-06 8.844157e-06 1.857273e-04 8.844157e-06 1.857273e-04
## t_2 9.577718e-06 9.577718e-06 9.577718e-06 9.577718e-06 9.577718e-06
## t_3 1.820246e-04 8.667840e-06 8.667840e-06 8.667840e-06 8.667840e-06
## t_4 9.171872e-06 9.171872e-06 9.171872e-06 9.171872e-06 9.171872e-06
## t_5 1.660498e-06 1.660498e-06 1.660498e-06 1.660498e-06 1.660498e-06
## t_6 1.089455e-05 1.089455e-05 1.089455e-05 1.089455e-05 1.089455e-05
Now the 15 topics have been created. To know about the topics quality we look into the topic coherence, this is a measure of how associated are words in a topic.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.0005412 0.0148679 0.0448704 0.0432286 0.0602130 0.1024139
We will use phi to get the top 5 terms per topic.
## [,1] [,2] [,3] [,4] [,5]
## t_1 "richard" "money" "love" "movie" "big"
## t_2 "monkey" "marcel" "julie" "dr" "drake"
## t_3 "mike" "job" "great" "guy" "ralph"
## t_4 "dad" "father" "stuff" "boat" "house"
## t_5 "guys" "wait" "time" "back" "make"
## t_6 "christmas" "year" "mom" "santa" "plane"
## t_7 "emma" "birthday" "party" "guys" "sister"
## t_8 "party" "job" "money" "man" "mark"
## t_9 "baby" "babies" "pregnant" "god" "father"
## t_10 "emily" "ben" "london" "great" "love"
## t_11 "wedding" "married" "love" "ring" "marry"
## t_12 "kiss" "date" "make" "people" "school"
## t_13 "thanksgiving" "god" "susan" "carol" "parents"
## t_14 "janice" "place" "move" "thing" "apartment"
## t_15 "god" "good" "great" "guy" "love"
The next step is to compute the topic prevalence using theta. Topic prevalence indicate the most frequent topics in the TV show.
Finally, we get a summary for the complete LDA model.
## topic coherence prevalence top_terms
## t_15 t_15 0.000 29.979 god, good, great, guy, love
## t_5 t_5 -0.001 18.761 guys, wait, time, back, make
## t_12 t_12 0.009 5.093 kiss, date, make, people, school
## t_14 t_14 0.021 4.721 janice, place, move, thing, apartment
## t_8 t_8 0.026 4.432 party, job, money, man, mark
## t_11 t_11 0.102 4.208 wedding, married, love, ring, marry
## t_4 t_4 0.061 4.020 dad, father, stuff, boat, house
## t_9 t_9 0.049 3.997 baby, babies, pregnant, god, father
## t_2 t_2 0.095 3.930 monkey, marcel, julie, dr, drake
## t_1 t_1 0.045 3.907 richard, money, love, movie, big
## t_3 t_3 0.008 3.839 mike, job, great, guy, ralph
## t_13 t_13 0.091 3.471 thanksgiving, god, susan, carol, parents
## t_7 t_7 0.050 3.445 emma, birthday, party, guys, sister
## t_10 t_10 0.059 3.100 emily, ben, london, great, love
## t_6 t_6 0.031 3.095 christmas, year, mom, santa, plane
We can see that the most prevalent (frequent) topic has words like “good”, “god”,“great”, “time”. This makes sense, this words are very frequent in the TV show and that is why they give very little information about the topic. That is why the coherence is 0.0.
The other topics in the model have less prevalence but they are more coherent. If you are a fan of the show and if you read the list of top terms, we are sure you can remember episodes in which those terms were important.
To find those important episodes we created a d3 tool. We wrote a csv file using theta in which, for each episode and topic we put the probability of that topic given the episode and the top terms of that topic.
## id topic value topic_num
## 1 1 : 01 t_1 2.226506e-02 1
## 2 1 : 01 t_14 4.539759e-02 14
## 3 1 : 01 t_6 8.771084e-03 6
## 4 1 : 01 t_15 3.297349e-01 15
## 5 1 : 01 t_13 3.114217e-01 13
## 6 1 : 01 t_7 9.638554e-05 7
## top_terms name
## 1 richard, money, love, movie, big Monica Gets A Roommate
## 2 janice, place, move, thing, apartment Monica Gets A Roommate
## 3 christmas, year, mom, santa, plane Monica Gets A Roommate
## 4 god, good, great, guy, love Monica Gets A Roommate
## 5 thanksgiving, god, susan, carol, parents Monica Gets A Roommate
## 6 emma, birthday, party, guys, sister Monica Gets A Roommate
We built an interactive tool where the user can search for the episodes that most relate to each one of the 15 topics. To access the tool click here.
To look for the main features that drive the rating up or down, it is useful to start by observing its temporal structure. The following graph is interactive:
Click on the season title to (de)activate each series
Next, to create a cleaner view of the data, we create a boxplot
The boxplots not only help to confirm in a cleaner graph the behavior observed previously, but also provides additional insights like:
A likely explanation for the behavior on the last 7 episodes of all seasons is that writers may have prepared some intricated plot (an not necessarily amusing) to be climaxed on the lasts episodes.
A feature interesting to explore y how the number of viewers and voters relate with the Average Rating.
The scatterplot matrix above helps to see that the better the episode is, the likelier that people will invest their time to rate it. Specifically this happens with an exponential relationship. In contrast, the the number of viewers show a small positive linear correlation.
This TV show is commonly perceived as having constantly increasing ratings and number of followers. However we can show that it is not completely true, at least while it was aired on TV before streaming services like Netflix became the big players they are today.
The following boxplot graph shows that the hype for the show and the main actor’s salaries were not backed by the number of people following the weekly episodes.
Clearly, the most successful season, when the actors were paid in the range from $20,000 to $40,000 per episode, was season 2 with about 50% (approx) more viewers than Season 10 when they were paid $1 million per episode.
Continuing with the number of viewers. When the last episode of a season is considered a good one, it is expected that people will watch the beginning of the next one and then loose stamina (until the last episodes of the current season).
An additional insight: episodes 19-22 usually were aired around ____________, which explains the low number of fans watching. Here the graph.
Probably the most ubiquitous discussion among Friends fans is which of the main characters is the best. Here we are going to define best as the character that drives the rating with more strength. To do so we will try to find if there is a relation between the number of lines of each character to the rating for each episode, as well as the interactions among all of the characters.
There is no surprise that Monica and Chandler are the characters with most interactions between them since they were a couple for a longer time than Ross and Rachel (second place).
As to the rating, we can see that individual participation don’t hold a high correlation with the rating of an episode. However, we should notice that all of them are positive and that Ross is the character with the highest one, which aligns with the previous analysis of interactions.
To test how statistically significant are these results we ran correlation hypothesis tests. In consideration that the relation will not necessarily linear and that there are many ties in rating (due to rounding) we decided to run both parametric (Pearson) and not-parametric (Spearman) hypothesis test for \(H_0: \rho \leq 0\) vs \(H_1:\rho>0\) that resulted in the following p-values:
| Spearman | Pearson | |
|---|---|---|
| CHANDLER | 0.0218936 | 0.1678547 |
| JOEY | 0.0429544 | 0.0969200 |
| MONICA | 0.0220195 | 0.0029182 |
| PHOEBE | 0.5644974 | 0.0411303 |
| RACHEL | 0.0608283 | 0.0328136 |
| ROSS | 0.0005194 | 0.0000345 |
Ross seems to be the most relevant character given the number of lines and scenes in which he participates, that he has the highest correlation with the rating and that the significantly greater than zero under both hypothesis tests.
We included four elements that are interactive:
This project turned out to be a complex challenge from the point of data management since we had to move from semi-structured data such as HTML files to structured databases that could be exploitable for the purposes that the project requested.
Additionally, we believe that asking complex questions for a topic as popular as Friends was a challenge. That is why we decided to use unsupervised machine learning techniques, which allowed us to perform deep and objective analyzes under the premise that complete ignorance of the series of which we were a priori fans.
Finally, we believe that the data to which we have access has the potential to allow us to make more complete and complex analyzes. For example, we would have liked to have had more time and resources to carry out a deeper analysis of the temporal interactions of the characters under a graphic theory perspective.